The PCA results are actually very similar if you use a CPM cutoff of 1 or 5, so I am sticking with 1 for this analysis, because it will give us loadings for more genes.
The percentage of variance explained by each principal component. Based on this, I think we should keep 5-12 PCs for downstream analysis, but there’s no way to visualize that.
Since we don’t have a huge number of samples, I don’t think we can get much out of this type of analysis. I would treat these as purely exploratory. When we do differential expression testing, then it’s fine to use smaller subsets (8 v. 8 is still pretty robust, in that case), but I wouldn’t read too much into the PCA plots.
Subset 1: All controls
ind <-which(pheno$group %in%c('SC','EC'))subcpm <- cpm[,ind]filt <-which(rowMeans(subcpm) >1)pca <- BiocSingular::runPCA(t(log1p(subcpm[filt, ])), rank=length(ind))#plot(pca$sdev^2/sum(pca$sdev^2))plot_ly(data.frame(pca$x), x=~PC1, y=~PC2, z=~PC3, color = pheno$color[ind], colors =sort(unique(pheno$color[ind])))
Subset 2: All post-bypass samples
ind <-which(pheno$timepoint %in%c('B','C'))subcpm <- cpm[,ind]filt <-which(rowMeans(subcpm) >1)pca <- BiocSingular::runPCA(t(log1p(subcpm[filt, ])), rank=length(ind))#plot(pca$sdev^2/sum(pca$sdev^2))plot_ly(data.frame(pca$x), x=~PC1, y=~PC2, z=~PC3, color = pheno$color[ind], colors =sort(unique(pheno$color[ind])))